# OpenACC Performance Optimization 04.10.2017 J. Kraus (NVIDIA) #### **Outline** - Memory coalescing - Loop optimizations # **CSR** sparse matrix storage | | 0 | 1 | 2 | 3 | 4 | |---|----|----|----|----|----| | 0 | -2 | 1 | 0 | 0 | 0 | | 1 | 1 | -2 | 1 | 0 | 0 | | 2 | 0 | 1 | -2 | 1 | 0 | | 3 | 0 | 0 | 1 | -2 | 1 | | 4 | 0 | 0 | 0 | 1 | -2 | | | 0 | 1 | 2 | 3 | 4 | |---|----|----|----|----|----| | 0 | -2 | 1 | | | | | 1 | 1 | -2 | 1 | | | | 2 | | 1 | -2 | 1 | | | 3 | | | 1 | -2 | 1 | | 4 | | | | 1 | -2 | ### **Sparse Matrix Vector Product (SpMV)** ``` 42: #pragma acc parallel loop 43:for (int row=0; row<num rows; ++row) 44:{ 45: y[row] = 0.0; 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; 48: for (int col idx=row start; col idx<row end; ++col idx)</pre> 49: 50: y[row] += val[col idx] * x[ col ptr[col idx] ]; 51: } 52: 53:} ``` ``` pgcc -fast -acc -ta=tesla -Minfo=accel spmv.c -o spmv main: 36, Generating copyin(row_ptr[:num_rows+1],col_ptr[:num_vals],val[:num_vals],x[:num_rows]) Generating copy(y[:num_rows]) 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 48, Complex loop carried dependence of y-> prevents parallelization Loop carried reuse of y-> prevents parallelization ./spmv Runtime 0.148565 s. ``` Disable usage of texture cache to see uncoaleseced memory accesses # **Memory Coalescing** - Coalesced access: - A group of 32 contiguous threads ("warp") accessing adjacent words - Few transactions and high utilization - Uncoalesced access: - A warp of 32 threads accessing scattered words - Many transactions and low utilization - For best performance threadIdx.x should access contiguously ### **OpenACC: 3 Levels of Parallelism** - Vector threads work in lockstep (SIMD/SIMT parallelism) - Workers have 1 or more vectors - Gangs have 1 or more workers and share resources (such as a cache, the SM, etc.) - Multiple gangs work independently of each other #### **CUDA Execution Model** #### Software Threads are executed by scalar processors Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file) **Block** A kernel is launched as a grid of thread blocks Blocks and grids can be multi dimensional (x,y,z) ### **CUDA Warps** A thread block consists of a groups of warps A warp is executed physically in parallel (SIMT) on a multiprocessor Currently all NVIDIA GPUs use a warp size of 32 # Mapping OpenACC to CUDA - The compiler is free to do what it wants - In general - (COARSE GRAIN) gang: mapped to blocks - (FINE GRAIN) worker: mapped to threads - vector: mapped to threads (FINE SIMD/SIMT) - Exact mapping is compiler dependent - Performance Tips - Use a vector size that is divisible by 32 - Block size is num\_workers \* vector\_length # OpenACC gang, worker, vector clauses - Gang, worker, vector can be added to a loop clause - Control the size using the following clauses on the parallel region - Parallel: num\_gangs(n), num\_workers(n), vector\_length(n) - Kernels: gang(n), worker(n), vector(n) ``` #pragma acc parallel loop gang worker for (int row=0; row<num_rows; ++row) { #pragma acc loop vector for (int col_idx=row_start; col_idx<row_end; ++col_idx)</pre> ``` gang, worker, vector appear once per parallel region # **Understanding Compiler Output** ``` 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ ``` - Compiler is reporting how it is assigning work to the device - Gang is being mapped to blockldx.x - Vector is being mapped to threadIdx.x - Worker is not used - This application has a thread block size of 128 and launches as many blocks as necessary ### **SpMV** ``` 42: #pragma acc parallel loop Want this loop to parallelize 43:for (int row=0; row<num rows; ++row) with vector parallelism 44:{ 45: y[row] = 0.0; 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; for (int col idx=row start; col idx<row end; ++col idx)</pre> 48: 49: 50: y[row] += val[col idx] * x[ col ptr[col idx] ]; 51: } ``` 48, Complex loop carried dependence of y-> prevents parallelization Loop carried reuse of y-> prevents parallelization #### **SpMV** ``` 42: #pragma acc parallel loop 43:for (int row=0; row<num rows; ++row) 44:{ 45: double y tmp = 0.0; 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; for (int col idx=row start; col idx<row_end; ++col_idx)</pre> 48: 49: 50: y tmp += val[col idx] * x[ col ptr[col idx] ]; 51: Sum up in temporary 52: y[row] = y tmp; to remove loop 53:} carried dependency ``` ``` pgcc -fast -acc -ta=tesla -Minfo=accel spmv.c -o spmv main: 36, Generating copyin(row ptr[:num rows+1],col ptr[:num vals],val[:num vals],x[:num rows]) Generating copy(y[:num rows]) 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang /* blockIdx.x */ 48, #pragma acc loop vector(128) /* threadIdx.x */ 50, Sum reduction generated for y tmp 48, Loop is parallelizable ./spmv Runtime 0.166006 s. ``` ``` 42: #pragma acc parallel loop 43:for (int row=0; row<num rows; ++row) gang 44: { 45: double y tmp = 0.0; vector (128) 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; for (int col_idx=row_start; col_idx<row end; ++col idx)</pre> 48: 49: 50: y tmp += val[col idx] * x[col ptr[col idx]]; 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang /* blockIdx.x */ 48, #pragma acc loop vector(128) /* threadIdx.x */ 50, Sum reduction generated for y tmp ``` Mitglied der Helmholtz-Gem # Providing more information to the compiler - We know that each row of the used Matrix has only 27 elements - Using 128 threads for 27 elements does not make sense - Let's tell the compiler to use fewer threads for each row ``` 42: #pragma acc parallel loop vector length (32) 43:for (int row=0; row<num rows; ++row) - gang 44: { 45: double y tmp = 0.0; vector(32) 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; 48: for (int col_idx=row_start; col_idx<row end; ++col idx)</pre> 49: 50: y tmp += val[col idx] * x[col ptr[col idx]]; 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang /* blockIdx.x */ 48, #pragma acc loop vector(32) /* threadIdx.x */ 50, Sum reduction generated for y tmp ``` Mitglied der Helmholtz-Gemei ``` pgcc -fast -acc -ta=tesla -Minfo=accel spmv.c -o spmv main: 36, Generating copyin(row ptr[:num rows+1],col ptr[:num vals],val[:num vals],x[:num rows]) Generating copy(y[:num rows]) 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang /* blockIdx.x */ 48, #pragma acc loop vector(32) /* threadIdx.x */ 50, Sum reduction generated for y tmp 48, Loop is parallelizable ./spmv Runtime 0.119796 s. (was 0.166006 s) ``` # Keeping the code performance portable - The device\_type clause allows device specific tuning without harming performance portability - All clauses following a device\_type clause only apply for the given target: ``` #pragma acc parallel loop device_type(NVIDIA) vector_length(32) for (int row=0; row<num_rows; ++row) {</pre> ``` #### **Tasks** - Task 0: Coalescing memory accesses (repeat) - Task 1: Use vector\_length to improve the warp execution efficiency (repeat what was shown) - Task 2: Use the guided analysis to further improve the performance. - Hint: Add worker level parallelism to increase the block size to 128 threads (required to get full occupancy). ### **SpMV** ``` 42: #pragma acc parallel loop device type (NVIDIA) gang worker vector length (32) gang, worker(4) 43:for (int row=0; row<num rows; ++row) 44:{ vector (32) 45: double y tmp = 0.0; 46: const int row start = row ptr[row]; 47: const int row end = row ptr[row+1]; 48: for (int col idx=row start; col idx<row end; ++col idx) 49: 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang, worker(4) /* blockIdx.x threadIdx.y */ 48, #pragma acc loop vector(32) /* threadIdx.x */ 50, Sum reduction generated for y tmp ``` # **Understanding Compiler Output (recap)** ``` 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang, worker(4) /* blockIdx.x threadIdx.y */ 48, #pragma acc loop vector(32) /* threadIdx.x */ 50, Sum reduction generated for y_tmp ``` - Compiler is reporting how it is assigning work to the device - Gang is being mapped to blockldx.x - Worker is being mapped to threadIdx.y - Vector is being mapped to threadIdx.x - This application has a thread block size of 4x32 and launches as many blocks as necessary ``` pgcc -fast -acc -ta=tesla -Minfo=accel spmv.c -o spmv main: 36, Generating copyin(row ptr[:num rows+1],col ptr[:num vals],val[:num vals],x[:num rows]) Generating copy(y[:num rows]) 42, Accelerator kernel generated Generating Tesla code 43, #pragma acc loop gang, worker(4) /* blockIdx.x threadIdx.y */ 48, #pragma acc loop vector(32) /* threadIdx.x */ 50, Sum reduction generated for y tmp 48, Loop is parallelizable ./spmv Runtime 0.047039 s. (was 0.119796 s and 0.166006 s) ``` #### **Conclusions** - The NVIDIA Visual Profiler can be used to identify performance bottlenecks in OpenACC Kernels - Coalescing memory accesses is important for performance - Using loop clauses allows to provide runtime information (approximate length of matrix rows) to the compiler for better performance.